Automatic Clustering Algorithms
   HOME

TheInfoList



OR:

Automatic clustering algorithms are algorithms that can perform clustering without prior knowledge of data sets. In contrast with other
cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
techniques, automatic clustering algorithms can determine the optimal number of clusters even in the presence of noise and outlier points.


Centroid-based

Given a set of ''n'' objects, centroid-based algorithms create ''k'' partitions based on a dissimilarity function, such that ''k≤n''. A major problem in applying this type of algorithm is determining the appropriate number of clusters for unlabeled data. Therefore, most research in clustering analysis has been focused on the automation of the process. Automated selection of ''k'' in a ''K''-means clustering algorithm, one of the most used centroid-based clustering algorithms, is still a major problem in machine learning. The most accepted solution to this problem is the elbow method. It consists of running ''k''-means clustering to the data set with a range of values, calculating the sum of squared errors for each, and plotting them in a line chart. If the chart looks like an arm, the best value of ''k'' will be on the "elbow". Another method that modifies the ''k''-means algorithm for automatically choosing the optimal number of clusters is the ''G''-means algorithm. It was developed from the hypothesis that a subset of the data follows a Gaussian distribution. Thus, ''k'' is increased until each ''k''-means center's data is Gaussian. This algorithm only requires the standard statistical significance level as a parameter and does not set limits for the covariance of the data.


Connectivity-based (hierarchical clustering)

Connectivity-based clustering or
hierarchical clustering In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into ...
is based on the idea that objects have more similarities to other nearby objects than to those further away. Therefore, the generated clusters from this type of algorithm will be the result of the distance between the analyzed objects. Hierarchical models can either be divisive, where partitions are built from the entire data set available, or agglomerating, where each partition begins with a single object and additional objects are added to the set. Although hierarchical clustering has the advantage of allowing any valid metric to be used as the defined distance, it is sensitive to noise and fluctuations in the data set and is more difficult to automate. Methods have been developed to improve and automate existing hierarchical clustering algorithms such as an automated version of single linkage hierarchical cluster analysis (HCA). This computerized method bases its success on a self-consistent outlier reduction approach followed by the building of a descriptive function which permits defining natural clusters. Discarded objects can also be assigned to these clusters. Essentially, one needs not to resort to external parameters to identify natural clusters. Information gathered from HCA, automated and reliable, can be resumed in a
dendrogram A dendrogram is a diagram representing a tree. This diagrammatic representation is frequently used in different contexts: * in hierarchical clustering, it illustrates the arrangement of the clusters produced by the corresponding analyses. ...
with the number of natural clusters and the corresponding separation, an option not found in classical HCA. This method includes the two following steps: outliers being removed (this is applied in many filtering applications) and an optional classification allowing expanding clusters with the whole set of objects.
BIRCH A birch is a thin-leaved deciduous hardwood tree of the genus ''Betula'' (), in the family Betulaceae, which also includes alders, hazels, and hornbeams. It is closely related to the beech-oak family Fagaceae. The genus ''Betula'' contains 30 ...
(balanced iterative reducing and clustering using hierarchies) is an algorithm used to perform connectivity-based clustering for large data-sets. It is regarded as one of the fastest clustering algorithms, but it is limited because it requires the number of clusters as an input. Therefore, new algorithms based on BIRCH have been developed in which there is no need to provide the cluster count from the beginning, but that preserves the quality and speed of the clusters. The main modification is to remove the final step of BIRCH, where the user had to input the cluster count, and to improve the rest of the algorithm, referred to as tree-BIRCH, by optimizing a threshold parameter from the data. In this resulting algorithm, the threshold parameter is calculated from the maximum cluster radius and the minimum distance between clusters, which are often known. This method proved to be efficient for data sets of tens of thousands of clusters. If going beyond that amount, a supercluster splitting problem is introduced. For this, other algorithms have been developed, like MDB-BIRCH, which reduces super cluster splitting with relatively high speed.


Density-based

Unlike partitioning and hierarchical methods, density-based clustering algorithms are able to find clusters of any arbitrary shape, not only spheres. The density-based clustering algorithm uses autonomous machine learning that identifies patterns regarding geographical location and distance to a particular number of neighbors. It is considered autonomous because a priori knowledge on what is a cluster is not required. This type of algorithm provides different methods to find clusters in the data. The fastest method is
DBSCAN Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: give ...
, which uses a defined distance to differentiate between dense groups of information and sparser noise. Moreover, HDBSCAN can self-adjust by using a range of distances instead of a specified one. Lastly, the method
OPTICS Optics is the branch of physics that studies the behaviour and properties of light, including its interactions with matter and the construction of instruments that use or detect it. Optics usually describes the behaviour of visible, ultraviole ...
creates a reachability plot based on the distance from neighboring features to separate noise from clusters of varying density. These methods still require the user to provide the cluster center and cannot be considered automatic. The Automatic Local Density Clustering Algorithm (ALDC) is an example of the new research focused on developing automatic density-based clustering. ALDC works out local density and distance deviation of every point, thus expanding the difference between the potential cluster center and other points. This expansion allows the machine to work automatically. The machine identifies cluster centers and assigns the points that are left by their closest neighbor of higher density. ' In the automation of data density to identify clusters, research has also been focused on artificially generating the algorithms. For instance, the Estimation of Distribution Algorithms guarantees the generation of valid algorithms by the
directed acyclic graph In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one ve ...
(DAG), in which nodes represent procedures (building block) and edges represent possible execution sequences between two nodes. Building Blocks determine the EDA's alphabet or, in other words, any generated algorithm. Clustering algorithms artificially generated are compared to DBSCAN, a manual algorithm, in experimental results.


References

{{Reflist Clustering criteria Network analysis Cluster analysis algorithms Data mining